Patent Grant Introduction
A patent is a legal monopoly granted by a government in return for public disclosure of an invention. A granted patent gives the proprietor the right to prevent others using the invention in the territory to which the patent applies. Patent analysis is a great management tool for addressing the strategic management of the company’s technology and product or service development process. Changing patent data into competitive intelligence allows companies to see its current technical competitiveness, to forecast technological trends, and to plan for potential competition based on new technologies.
Objective
Patent has a big effect on businesses–strategic planning, mergers, acquisitions, licensing opportunities, R&D management, human resources, competitive intelligence, business intelligence, etc. The intent of this project is to explore the methods and techniques that are commonly used in patent analysis and examine patent analysis as a tool for competitive intelligence.
Data Introduction
We extract the data from 2011-2015 count of patent grant over states. The analyzed dataset is taken from U.S. Patent And Trademark Office, Patent Technology Monitoring Team (PTMT) which contain the following information:
-U.S. or Foreign
-Code
-State-Territory or Country
-First-Named Assignee
-Year(2011:2015)
These features help to show the rate of revenue, the rate of patent and the rate of application.
Research questions:
How about the relationship between patent data and revenue of companies? How can we predict the revenue by using patent data?
-Building map, a pie chart and a bar chart to find a suitable geographical region, data scope, and scale for this research
-Collecting raw data
-Basically analyzing and diagnosing the chosen data with box plots and scatter plots.
-Cleaning data,discarding NA, and omitting outliers
-Finding out the potential predictors by utilizing a correlation matrix
-Building Linear Regression Models or Other fitting medels
-Evaluating models with statistics criteria
-Testing the model with testing data (we collect data within 2011~2015 as training data, and use data in 2017 as testing data)
The following shows how the numbers of patent distribute among the states.
The overview of patent ratio in different states:
The pie chart shows California ranks in first place with much strong percentage.
Looking into Top 10 patent holders in CA
From the perspective of time, the count of patent grants can be shown as different colours.
Enterprise’s revenue can be predicted by using patent data as an indicator of R&D strength.
Our patent data is collected from USPTO and the revenue information is retrieved from https://www.macrotrends.net/ . This raw data we get in hand only contains the counts of patents and applications and the amount of revenue for each year from 2011 to 2015. After pre-processing, we obtain the Preliminary Data conprising growth rates whithin 2012~2015 for following analysis.
We omit University of California from our preliminary data frame, because it’s not a private business entity and we didn’t find its revenue. Hence, there are 9 companies left in our data set.
| variable: | denotation: |
|---|---|
| rev_rate | the rate of revenue growth |
| pat_rate | the growth rate of granted patents |
| pat_total | the total of granted patents accumulated from 2011 |
| app_rate | the growth rate of patent applications |
| app_total | the total of patent applications accumulated from 2011 |
| X | rev_rate | pat_rate | app_rate | app_total | pat_total | company | Year |
|---|---|---|---|---|---|---|---|
| 1 | 0.2783981 | 0.4616541 | 0.1440729 | 3527 | 1637 | QUALCOMM | 2012 |
| 2 | 0.4458147 | 0.5350734 | 0.4978084 | 3989 | 1554 | APPLE | 2012 |
| 3 | 0.2145891 | 1.3588957 | 0.5192225 | 5832 | 1095 | 2012 | |
| 4 | 0.0119863 | 0.0100784 | -0.0278970 | 1838 | 1795 | BROADCOM | 2012 |
| 5 | 0.0000000 | 0.0976059 | 0.1493931 | 2302 | 1139 | HP | 2012 |
| 6 | 0.0657828 | -0.0438931 | 0.0244233 | 1492 | 1025 | CISCO | 2012 |
| X | rev_rate | pat_rate | app_rate | |
|---|---|---|---|---|
| Length:36 | Min. :-0.43622 | Min. :-0.21372 | Min. :-0.70880 | |
| Class :character | 1st Qu.:-0.03815 | 1st Qu.:-0.03853 | 1st Qu.:-0.17151 | |
| Mode :character | Median : 0.05763 | Median : 0.09785 | Median :-0.02499 | |
| NA | Mean : 0.08289 | Mean : 0.15729 | Mean : 0.01498 | |
| NA | 3rd Qu.: 0.14933 | 3rd Qu.: 0.23755 | 3rd Qu.: 0.14540 | |
| NA | Max. : 0.69405 | Max. : 1.35890 | Max. : 0.88103 |
| app_total | pat_total | company | Year | |
|---|---|---|---|---|
| Min. : 1492 | Min. : 634 | APPLE : 4 | 2012:9 | |
| 1st Qu.: 3159 | 1st Qu.:1328 | BROADCOM: 4 | 2013:9 | |
| Median : 4684 | Median :1950 | CISCO : 4 | 2014:9 | |
| Mean : 7333 | Mean :2483 | GOOGLE : 4 | 2015:9 | |
| 3rd Qu.: 9756 | 3rd Qu.:3010 | HP : 4 | NA | |
| Max. :32676 | Max. :7484 | IBM : 4 | NA | |
| NA | NA | (Other) :12 | NA |
In order to further analyze this data, we need to normalize pat_total and app_total.
After processing:
plot_ly(type = "box") %>%
add_boxplot(y = top9_norm$rev_rate,name = "rev_rate")%>%
add_boxplot(y = top9_norm$pat_rate,name = "pat_rate")%>%
add_boxplot(y = top9_norm$pat_total,name = "pat_total")%>%
add_boxplot(y = top9_norm$app_rate,name = "app_rate")%>%
add_boxplot(y = top9_norm$app_total,name = "app_total")
Afte processing preliminary data, we draw a correlation plot to figure our the correlationship among these variables. Because our goal is to build a model for predicting revenue growth, we focus on the middle row which shows the correlationship between the rev_rate and other variables. In this matrix, we know when the circle is larger the correlationship is stronger. Hence, we pick up pat_rate and pat_total as indicaters to build our models.
To know more about these two variables, pat_rate and pat_total, We also draw scatter plots. In these two plots, we can easily understand the distribution of these two data sets, and we find that there are some outliers in these two data sets.
s1=ggplot(data = top9_norm, aes(x = pat_rate, y = rev_rate)) + geom_point(aes(color = company))
ggplotly(s1)
s2=ggplot(data = top9_norm, aes(x = pat_total, y = rev_rate)) + geom_point(aes(color = company))
ggplotly(s2)
Before we start to build models, there is one more task we need to do. It’s to ommit outliers. The code below is the function we used to sieve out outliers. And the graph below demostrates the changes of data after we do so.
FindOutliers <- function(x){
q = quantile(x)
IQR = q[4]-q[2] # IQR = Q3-Q1
lower_limit = q[2]-1.5*IQR # Q1-1.5*IQR
upper_limit = q[4]+1.5*IQR # Q3+1.5*IQR
list =c(lower_limit,upper_limit)
return(list)
}
Now, we have two candidates, pat_total and pat_rate. First, We use pat_total as a predictor to build linear regression model for rev_rate.
lm_pat_total <- lm(rev_rate~pat_total, data=top9_cleanData2)
ggplot(top9_cleanData2, aes(pat_total, rev_rate)) +
ggtitle("Linear Regression")+
geom_point() +
geom_smooth(method = "lm")
After builing a model, we check the following criteria.
–Q1.Is the model statistically significant? (F statistics)
–Q2.Are the coefficients significant? (t statistics and p-values)
–Q3.Is the model useful? (R2 value)
–Q4.Does the model fit the data well? (residual plots)
–Q5.Does the data satisfy the assumptions?
Coefficients:
Estimate Std.Error t value Pr(>|t|) (Intercept) -0.000212 0.029644 -0.007 0.9943 pat_total 0.229157 0.115993 1.976 0.0585. Residual standard error: 0.1006 on 27 degrees of freedom
Multiple R-squared: 0.1263, Adjusted R-squared: 0.09394
F-statistic: 3.903 on 1 and 27 DF, p-value: 0.0585
par(mfrow=c(2,2))
plot(lm_pat_total)
The statistic criteria and residual plots reveal that the “pat_total” is not good enough to be a predictor, so next we turn to “pat_rate”.
Secondly, We use pat_rate as a predictor to build linear regression model for rev_rate.
After omitting outliers, we build linear regression on pat_rate.
A simple linear regression analysis indicates dependence between patent growth rate and revenue growth rate as the preview of the correlation analysis. The plot shows the linear model as follows.
lm_pat_rate <- lm(data = top9_cleanData,rev_rate~pat_rate)
ggplot(top9_cleanData, aes(pat_rate, rev_rate)) +
ggtitle("Linear Regression")+
geom_point() +
geom_smooth(method = "lm")
We also check the following criteria.
–Q1.Is the model statistically significant? ( F statistics)
–Q2.Are the coefficients significant? ( t statistics and p-values)
–Q3.Is the model useful? (R2 value)
Coefficients:
Estimate Std.Error t value Pr(>|t|) (Intercept) 0.01590 0.02007 0.792 0.4348 pat_rate 0.20323 0.09259 2.195 0.0366* Residual standard error: 0.09379 on 28 degrees of freedom
Multiple R-squared: 0.1468, Adjusted R-squared: 0.1163
F-statistic: 4.818 on 1 and 28 DF, p-value: 0.03663
–Q5.Does the data satisfy the assumptions?
Normality
shapiro.test(lm_pat_rate$residual)
##
## Shapiro-Wilk normality test
##
## data: lm_pat_rate$residual
## W = 0.93497, p-value = 0.06663
Independence
durbinWatsonTest(lm_pat_rate)
## lag Autocorrelation D-W Statistic p-value
## 1 0.003825231 1.834223 0.616
## Alternative hypothesis: rho != 0
That test statistic value in the range of 1.5 to 2.5 ia relatively normal.
Homogeneity of Variance
ncvTest(lm_pat_rate)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.0158844, Df = 1, p = 0.89971
Regarding these three criteria, all the p-values are larger than 0.05, so the assumptions are not rejected.
–Q4.Does the model fit the data well? (residual plots)
The statistic criteria and plots show that this model is acceptable and the is no strong evidence to reject our hypothesis.
Our training data is within 2011-2015, so we use later data as testing data. After collecting patent growth rate and actual revenue in 2017, we do T-test to test the prediction made by our model.
Comparing our prediction with actual revenue growth:
predict_rev = data.frame(lm_predict = predict(lm_pat_rate,newdata = test_data),
pat_rate =test_data$pat_rate,
rev_rate = testing$rev_rate)
t.test(predict_rev$rev_rate, predict_rev$lm_predict)
##
## Welch Two Sample t-test
##
## data: predict_rev$rev_rate and predict_rev$lm_predict
## t = 0.46001, df = 6.5616, p-value = 0.6604
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.07396614 0.10909785
## sample estimates:
## mean of x mean of y
## 0.03039091 0.01282505
p1 = ggplot(top9_cleanData, aes(pat_rate, rev_rate)) +
ggtitle("Prediction for 2017 Revenue Growth Rate-LM")+
geom_point() +
geom_smooth(method = "lm")+
geom_point(data=testing, colour="red", size=3)
ggplotly(p1)
MSE_LR = mean((testing$rev_rate - predict_rev$lm_predict)^2)
MSE_LR
## [1] 0.00992566
The total and the growth rate of applications have no significant impact on revenue growth.
The total and the growth rate of patent grants have more significant influence upon revenue growth.
Furthermore, comparing the two variables of patent grants, the influence from the growth rate is more obvious than the influence from the total.
The scope of this research is limited. Only Top 9 companies in the patent grant rank are investigated.
The amount of obsevations is not sufficient. More data is required to make the models more convincing.
The time of obsevation is limited. Only data before 2017 is collected and analyzed.